All about Git

Lecture 8

Jay Paul Morgan

Version Control Systems

If we’re just programming by ourselves we often just make the changes to the program as we need and move on. But what if we’re not the only person making changes? For example, there are thousands of developers contributing to large open-source projects like the Linux kernel, Deep Learning frameworks such as Pytorch or Tensorflow, and programming languages such as Python. How do we manage the changes from all of these thousands of independent developers while keeping track of what’s changed?

This is (one of) the role of version control systems, often abbreviated to VCS. A version control system is an additional layer of software over our programming code that allows us to ‘checkpoint’ the program code at a specific point in time. Moreover, it can help ‘merge’ changes from different developers, so that the changes made by one developer does not un-intentionally overwrite the changes made by a different developer.

Git, developed by Linus Torvalds in 2005, is one such version control system that is the most ubiquitous at the time of writing. It has surpassed many existing version control systems, and while many new ones have been proposed, none have been successful (yet) at unmounting Git from it’s throne as the leader of VCSs.

In this lecture, we’ll learn how to setup and use git in our projects.

Installing Git

Depending on the system you’re using, Git may or may not already be installed. If you’re using a debian based operating system, and you don’t have Git installed, you’ll want to use apt to install it.

sudo apt install git

Note You’ll need super-user privileges to install Git in this way.

If you’re using a different operating system, you can refer to Git’s own documentation for each supported system: https://git-scm.com/downloads

We can check that Git is installed by running the following command into the terminal:

type git

This, running on my computer, returns the path to the executable:

git is /usr/bin/git

You’ll probably see something similar on your machine.

Now we’re ready to start using Git!

Setting up a Git Repository

Let’s imagine we’re starting a new project. All of our program scripts for this project are going to go into a single folder I’ve named ‘my-new-project’.

So far, this directory is empty:

my-new-project % ls -lha
total 0
drwxr-xr-x   2 jaypaulmorgan  wheel    64B 12 Nov 12:59 .
drwxrwxrwt  16 root           wheel   512B 12 Nov 12:59 ..

Even though we have no files yet, we can initialise a Git repository for this directory (and this will work for directories that already have existing files as well) by using the init sub-command:

git init
Initialized empty Git repository in my-new-project/.git/

If successful, this command will tell us that it’s created a git repository. Formally, it has created a .git folder within our project folder that contains information about this git repository, i.e. history, name, etc.

A repository in this context is the directory of things that are going to be tracked by Git. In this context of this lecture, we’ll often use repository and directory interchangeably.

Now if we list the files and folders in this directory again, we should see there is only one new folder, the .git folder that Git told us it had created.

my-new-project % ls -lha
total 0
drwxr-xr-x   3 jaypaulmorgan  wheel    96B 12 Nov 13:01 .
drwxrwxrwt  16 root           wheel   512B 12 Nov 12:59 ..
drwxr-xr-x   9 jaypaulmorgan  wheel   288B 12 Nov 13:01 .git

With Git now initialised in our project folder, we can start our programming!

Staging files

After some time of programming, we’ve managed to create a couple of files and folders.

Let’s list these out (using the tree command so it looks nice):

my-new-project % tree
.
├── Makefile
├── README.md
└── src
    └── main.cpp

2 directories, 3 files

In this hypothetical scenario, we’ve been programming and we’ve created a main.cpp C++ file, and a Makefile to specify how to compile the program. We’ve also written a README.md markdown file that tells other developers how to use the program.

BUT We haven’t checkpointed these files. What does checkpoint mean here? Well if we make any further changes to the program, we’ll no longer be able to get back to the project as it currently is. By checkpointing the program in it’s current state, even if we make some changes, we’ll still be able to come back to this checkpoint at a later point in time.

To start checkpointing or as Git calls it commiting our files, we can first look at the current status of these files using the aptly-named sub-command status:

my-new-project % git status
On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    Makefile
    README.md
    src/

nothing added to commit but untracked files present (use "git add" to track)

Here we see that we’re on the branch main (we’ll come back to this at a later point in the lecture). We don’t have any commit yet, i.e. no checkpoints we can revert to. And, finally, we have some untracked files, i.e. all of the files we’ve created at this point.

Helpfully, Git has told us if we want to start tracking the changes to the files, we should use the add sub-command to stage the files ready for committing.

This is a perfect time to talk about the different statuses that each of the files could be in.

sequenceDiagram
    participant Unchanged
    participant Untracked/Modified
    participant Staged
    participant Commited
    Untracked/Modified->>Staged: git add <filename/directory>
    Staged->>Commited: git commit
    Commited->>Unchanged: A new checkpoint has been made
    Staged->>Untracked/Modified: git rm --cached <filename/directory>

In this diagram we show there are various ‘states’ each of the files or folders could be in. To commit a file or a folder, we’ll first want to stage the changes with

git add <file/folder>

and then when all the changes we want to commit are staged, we can finalise the commit using:

git commit

Let’s use these on our project now.

my-new-project % git add .
my-new-project % git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   Makefile
    new file:   README.md
    new file:   src/main.cpp

We’ve first used git add . to add all files in the current directory (as denoted by ‘.’ which means the current directory in Linux), and then re-ran the git status command, which shows that all of our files are now ready to be committed, i.e. they are staged!

If we wanted to remove a file from the staging area, we should use git rm --cached <name-of-file-or-folder> to un-stage it. Note this doesn’t delete the file/folder, just un-stages it.

Now, with the files in the staging area, we’re ready to commit them. For this, we use the git commit command. This will bring up a text editor where we can write what in general has changed since the last commit. This is called the commit message. When you’re done describing the changes that’s been made, just save and exit the file.

There is a shorthand way to do this using the -m flag:

git commit -m 'This is my commit message'

After committing we’ll get some output about what’s been checkpointed:

[main (root-commit) b587617] This is my commit message
 3 files changed, 254 insertions(+)
 create mode 100644 Makefile
 create mode 100644 README.md
 create mode 100644 src/main.cpp

Finally if we run git status again, we’ll see there are no changes since we’ve last committed (this makes sense since we just committed all of our changes).

my-new-project % git status
On branch main
nothing to commit, working tree clean

Log

As we make more changes to the project, we’ll want to look at the history of changes to see what has changed and when.

First, let’s make some more changes

There is a simple sub-command named log that will allow us to see a sequential list of changes to the project.

my-new-project % git status
On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   src/main.cpp

no changes added to commit (use "git add" and/or "git commit -a")

We’ve made a change to main.cpp, fixing a bug in the process. Now that we’re happy with the current state of the project (there are no more bugs that we’re currently aware of, but don’t worry there will always be more!), we want to create another commit.

my-new-project % git add src/main.cpp 
my-new-project % git commit -m 'Fix reading file bug caused by typo'
[main 1a5d58e] Fix reading file bug caused by typo
 1 file changed, 1 deletion(-)

Now if we use the command git log, we’ll see a sequential view of how the project has changed (from the perspective of checkpoints).

my-new-project % git log
commit 1a5d58e3e0c7796f7c5eb77083a6773f158d48b8 (HEAD -> main)
Author: Jay Paul Morgan <email@email.com>
Date:   Sun Nov 12 13:38:47 2023 +0100

    Fix reading file bug caused by typo

commit b587617fed362faec952718785112d3e7d32b038
Author: Jay Paul Morgan <email@email.com>
Date:   Sun Nov 12 13:28:23 2023 +0100

    This is my commit message

Here we see that older commits are at the bottom, and the most recent at the top. If we have many commits, we can scroll through them using the up and down arrows on our keyboard, and q to quit.

Branches

So far, we’ve been editing to the main (or master if you’re editing an older Git repository). This, if you’re a solo developer, is perhaps okay, but the main branch is meant to represent a working state of the program, if we make some changes to the program, then it’s potentially in a non-working state. Philosophically speaking, this is not ideal. Our preference as good and organised developers, who we all aspire to be, is to make changes in a separate branch, and then when we’re happy with the changes, and that we have a new version of the working program, we’ll want to merge back into the main branch.

gitGraph
   commit
   commit
   branch develop
   checkout develop
   commit
   commit
   checkout main
   merge develop
   commit
   commit

This type of workflow is called git-flow. In essence, for every new feature of the program we’re adding, we’ll create a new branch, and then merge back into the main branch when its complete. The depths of this concept are not necessary for this lecture, but if you’re interested, please do read https://www.gitkraken.com/learn/git/git-flow for an introduction into this workflow.

Nevertheless, we’ll still want to learn how to create new branches. At this point, our project looks like:

gitGraph
    commit
    commit

We’ll want to make some more changes, but only in a new branch other than main.

Creating & Checking out Branches

To create a new branch use the branch sub-command, specifying the name of the new branch:

my-new-project % git branch develop
my-new-project % git branch
  develop
* main

Here, we’ve created a new branch develop, and listed all of the existing branches using git branch.

The asterisk (*) next to the branch name tells us what branch we’re currently on.

To change branches we can checkout a new branch:

my-new-project % git checkout develop
Switched to branch 'develop'
my-new-project % git branch
* develop
  main

Our project now looks like:

gitGraph
    commit
    commit
    branch develop
    checkout develop

Git will tell us when it’s changing branches as you see in the above command.

Now that we’re on the new branch, we can start to make some changes, stage, and commit them:

my-new-project % git status
On branch develop
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   src/main.cpp

no changes added to commit (use "git add" and/or "git commit -a")
my-new-project % git add src/main.cpp  
my-new-project % git commit -m 'Add new feature'
[develop a59f1a7] Add new feature
 1 file changed, 1 insertion(+)

Now our project looks like:

gitGraph
    commit
    commit
    branch develop
    checkout develop
    commit

Merging Branches

Our ‘feature’ is complete, and we have a working state of the program, so we’ll want to merge this new feature back into the main branch.

First, we’ll checkout the main branch

my-new-project % git checkout main
Switched to branch 'main'

And now, we’ll merge the develop branch into the main branch using the merge sub-command:

my-new-project % git merge develop
Updating 1a5d58e..a59f1a7
Fast-forward
 src/main.cpp | 1 +
 1 file changed, 1 insertion(+)

Our project now looks like:

gitGraph
    commit
    commit
    branch develop
    checkout develop
    commit
    checkout main
    merge develop

Merge Conflicts

Sometimes, though, the branches cannot automatically be merged together. This can happen when the branches being merged have edited the same piece of text. Which edits does Git keep when merging? It’s a piece of software, not a mind-reader! It can’t know the answer to this question so we have to tell Git what to keep and what to throw away to complete the merging process.

So, let’s imagine we’re trying to merge two branches that have edited the same text. I’ve created this scenario by editing the title of the README file in two branches and tried to merge them. At this point this happened:

my-new-project % git merge develop
Auto-merging README.md
CONFLICT (content): Merge conflict in README.md
Automatic merge failed; fix conflicts and then commit the result.

Git is telling me: “I can’t automatically merge these two branches because they’ve edited the same thing. Tell me what to keep and then we can carry on.”

So we’ll do just that, if we open up the README.md file mentioned in the merge conflict message, we’ll see:

<<<<<<< HEAD 
# Deep Learning 
======= 
# C++ Examples of Deep Learning 
>>>>>>> develop 

Everything between <<<<<<< HEAD and ======= is what’s currently in the commit. While between ======= and >>>>>>> develop is the content trying to be merged.

Let’s say that we prefer what is in the develop branch, then we’ll remove (just by deleting in your text editor of choice) everything from <<<<<<< HEAD to ======= and then remove >>>>>>> develop so that our file now looks like this:

# C++ Examples of Deep Learning 

In essence we’ve extracted the parts of the file we wanted to keep in the process of merging, and removed the parts we didn’t want, in addition to removing the <<<<<, ===== delimiters.

Now we can save this file and commit the changes, thus completing the merge conflicts.

my-new-project % git status
On branch main
You have unmerged paths.
  (fix conflicts and run "git commit")
  (use "git merge --abort" to abort the merge)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
    both modified:   README.md

no changes added to commit (use "git add" and/or "git commit -a")
my-new-project % git commit -a -m 'Resolve conflicts'
[main c8efca4] Resolve conflicts

Our git history will now look something like:

gitGraph
    commit
    commit
    branch develop
    checkout develop
    commit
    checkout main
    merge develop
    commit
    checkout develop
    commit
    checkout main
    merge develop

Remote Repositories

When we want to share our project with the world (by sharing the source code), we can host the code on a Git Remote repository.

There are a couple of popular websites to do this: - GitHub - GitLab - Codeberg - SourceHut

Since it’s outside of the scope of this specific lecture on how to use each of these websites (since they might require specific instructions for each website), I recommend you read the documentation for your website of choice. For example, if you wanted to use GitHub, there is a Getting Started guide.